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Abstract 

We consider the problem of learning deep gener¬ 
ative models from data. We formulate a method 
that generates an independent sample via a sin¬ 
gle feedforward pass through a multilayer pre- 
ceptron, as in the recently proposed generative 
adversarial networks (Goodfellow et al., 2014). 
Training a generative adversarial network, how¬ 
ever, requires careful optimization of a difficult 
minimax program. Instead, we utilize a tech¬ 
nique from statistical hypothesis testing known 
as maximum mean discrepancy (MMD), which 
leads to a simple objective that can be interpreted 
as matching all orders of statistics between a 
dataset and samples from the model, and can be 
trained by backpropagation. We further boost 
the performance of this approach by combining 
our generative network with an auto-encoder net¬ 
work, using MMD to learn to generate codes that 
can then be decoded to produce samples. We 
show that the combination of these techniques 
yields excellent generative models compared to 
baseline approaches as measured on MNIST and 
the Toronto Face Database. 

1. Introduction 

The most visible successes in the area of deep learning have 
come from the application of deep models to supervised 
learning tasks. Models such as convolutional neural net¬ 
works (CNNs), and long short term memory (LSTM) net¬ 
works are now achieving impressive results on a number of 
tasks such as object recognition (Krizhevsky et al., 2012; 
Sermanet et al., 2014; Szegedy et al., 2014), speech recog¬ 
nition (Graves & Jaitly, 2014; Hinton et al., 2012a), image 
caption generation (Vinyals et al., 2014; Fang et al., 2014; 
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Kiros et al., 2014), machine translation (Cho et al., 2014; 
Sutskever et al., 2014), and more. Despite their successes, 
one of the main bottlenecks of the supervised approach is 
the difficulty in obtaining enough data to learn abstract fea¬ 
tures that capture the rich structure of the data. It is well 
recognized that a promising avenue is to use unsupervised 
learning on unlabelled data, which is far more plentiful and 
cheaper to obtain. 

A long-standing and inherent problem in unsupervised 
learning is defining a good method for evaluation. Gen¬ 
erative models offer the ability to evaluate generalization 
in the data space, which can also be qualitatively assessed. 
In this work we propose a generative model for unsuper¬ 
vised learning that we call generative moment matching 
networks (GMMNs). GMMNs are generative neural net¬ 
works that begin with a simple prior from which it is easy 
to draw samples. These are propagated deterministically 
through the hidden layers of the network and the output is 
a sample from the model. Thus, with GMMNs it is easy 
to quickly draw independent random samples, as opposed 
to expensive MCMC procedures that are necessary in other 
models such as Boltzmann machines (Ackley et al., 1985; 
Hinton, 2002; Salakhutdinov & Hinton, 2009). The struc¬ 
ture of a GMMN is most analogous to the recently pro¬ 
posed generative adversarial networks (GANs) (Goodfel¬ 
low et al., 2014), however unlike GANs, whose training in¬ 
volves a difficult minimax optimization problem, GMMNs 
are comparatively simple; they are trained to minimize a 
straightforward loss function using backpropagation. 

The key idea behind GMMNs is the use of a statistical hy¬ 
pothesis testing framework called maximum mean discrep¬ 
ancy (Gretton et al., 2007). Training a GMMN to mini¬ 
mize this discrepancy can be interpreted as matching all 
moments of the model distribution to the empirical data dis¬ 
tribution. Using the kernel trick, MMD can be represented 
as a simple loss function that we use as the core training 
objective for GMMNs. Using minibatch stochastic gradi¬ 
ent descent, training can be kept efficient, even with large 
datasets. 
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As a second contribution, we show how GMMNs can be 
used to bootstrap auto-encoder networks in order to fur¬ 
ther improve the generative process. The idea behind this 
approach is to train an auto-encoder network and then ap¬ 
ply a GMMN to the code space of the auto-encoder. This 
allows us to leverage the rich representations learned by 
auto-encoder models as the basis for comparing data and 
model distributions. To generate samples in the original 
data space, we simply sample a code from the GMMN and 
then use the decoder of the auto-encoder network. 

Our experiments show that this relatively simple, yet very 
flexible framework is effective at producing good gener¬ 
ative models in an efficient manner. On MNIST and the 
Toronto Face Dataset (TFD) we demonstrate improved re¬ 
sults over comparable baselines, including GANs. Source 
code for training GMMNs will be made available at 
https://github.com/yujiali/gmmn. 


2. Maximum Mean Discrepancy 

Suppose we are given two sets of samples X = {xi}fL 1 
and Y = {yj}p =1 and are asked whether the generating 
distributions P\ = Py- Maximum mean discrepancy is 
a frequentist estimator for answering this question, also 
known as the two sample test (Gretton et ah, 2007; 2012a). 
The idea is simple: compare statistics between the two 
datasets and if they are similar then the samples are likely 
to come from the same distribution. 


Formally, the following MMD measure computes the mean 
squared difference of the statistics of the two sets of sam¬ 
ples. 
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Taking </> to be the identity function leads to matching the 
sample mean, and other choices of (f> can be used to match 
higher order moments. 


Written in this form, each term in Equation (2) only in¬ 
volves inner products between the <f> vectors, and therefore 
the kernel trick can be applied. 
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The kernel trick implicitly lifts the sample vectors into 
an infinite dimensional feature space. When this feature 
space corresponds to a universal reproducing kernel Hilbert 
space, it is shown that asymptotically, MMD is 0 if and only 
if P x = Py (Gretton et ah, 2007; 2012a). 

For universal kernels like the Gaussian kernel, defined as 
k(x,x') = exp(— ^\x — x'\ 2 ), where a is the bandwidth 
parameter, we can use a Taylor expansion to get an explicit 
feature map 4> that contains an infinite number of terms and 
covers all orders of statistics. Minimizing MMD under this 
feature expansion is then equivalent to minimizing a dis¬ 
tance between all moments of the two distributions. 

3. Related Work 

In this work we focus on generative models due to their 
ability to capture the salient properties and structure of 
data. Deep generative models are particularly appealing 
because they are capable of learning a latent manifold on 
which the data has high density. Learning this manifold 
allows smooth variations in the latent space to result in 
non-trivial transformations in the original space, effectively 
traversing between high density modes through low density 
areas (Bengio et al., 2013a). They are also capable of disen¬ 
tangling factors of variation, which means that each latent 
variable can become responsible for modelling a single, 
complex transformation in the original space that would 
otherwise involve many variables (Bengio et al., 2013a). 
Even if we restrict ourselves to the field of deep learning, 
there are a vast array of approaches to generative mod¬ 
elling. Below, we outline some of these methods. 

One popular class of generative models used in deep 
learning are undirected graphical models, such as Boltz¬ 
mann machines (Ackley et al., 1985), restricted Boltzmann 
machines (Hinton, 2002), and deep Boltzmann machines 
(Salakhutdinov & Hinton, 2009). These models are nor¬ 
malized by a typically intractable partition function, mak¬ 
ing training, evaluation, and sampling extremely difficult, 
usually requiring expensive Markov-chain Monte Carlo 
(MCMC) procedures. 

Next there is the class of fully visible directed models such 
as fully visible sigmoid belief networks (Neal, 1992) and 
the neural autoregressive distribution estimator (Larochelle 
& Murray, 2011). These admit efficient log-likelihood cal¬ 
culation, gradient-based learning and efficient sampling, 
but require that an ordering be imposed on the observ¬ 
able variables, which can be unnatural for domains such 
as images and cannot take advantage of parallel computing 
methods due to their sequential nature. 

More related to our own work, there is a line of research de¬ 
voted to recovering density models from auto-encoder net¬ 
works using MCMC procedures (Rifai et al., 2012; Bengio 
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et al., 2013b; 2014). These attempt to use contraction op¬ 
erators, or denoising criteria in order to generate a Markov 
chain by repeated perturbations during the encoding phase, 
followed by decoding. 

Also related to our own work, there is the class of deep, 
variational networks (Rezende et al., 2014; Kingma & 
Welling, 2014; Mnih & Gregor, 2014). These are also 
deep, directed generative models, however they make use 
of an additional neural network that is designed to approxi¬ 
mate the posterior over the latent variables. Training is car¬ 
ried out via a variational lower bound on the log-likelihood 
of the model distribution. These models are trained us¬ 
ing stochastic gradient descent, however they either re¬ 
quire that the latent representation is continuous (Kingma 
& Welling, 2014), or require many secondary networks to 
sufficiently reduce the variance of gradient estimates in or¬ 
der to produce a sufficiently good learning signal (Mnih & 
Gregor, 2014). 

Finally there is some early work that proposed the idea 
of using feed-forward neural networks to learn generative 
models. MacKay (1995) proposed a model that is closely 
related to ours, which also used a feed-forward network to 
map the prior samples to the data space. However, instead 
of directly outputing samples, an extra distribution is as¬ 
sociated with the output. Sampling was used extensively 
for learning and inference in this model. Magdon-Ismail & 
Atiya (1998) proposed to use a neural network to learn a 
transformation from the data space to another space where 
the transformed data points are uniformly distributed. This 
transformation network then learns the cumulative density 
function. 

4. Generative Moment Matching Networks 

4.1. Data Space Networks 

The high-level idea of the GMMN is to use a neural net¬ 
work to learn a deterministic mapping from samples of a 
simple, easy to sample distribution, to samples from the 
data distribution. The architecture of the generative net¬ 
work is exactly the same as a generative adversarial net¬ 
work (Goodfellow et al., 2014). However, we propose to 
train the network by simply minimizing the MMD crite¬ 
rion, avoiding the hard minimax objective function used in 
generative adversarial network training. 

More specifically, in the generative network we have a 
stochastic hidden layer h £ R ,; with II hidden units at 
the top with a prior uniform distribution on each unit inde¬ 
pendently, 

H 

p(h) = n U{h i) (4) 

3 = 1 

Here U(h) = |l[—1 < h < 1] is a uniform distribu- 



(a) GMMN (b) GMMN+AE 


Figure 1. Example architectures of our generative moment match¬ 
ing networks, (a) GMMN used in the input data space, (b) 
GMMN used in the code space of an auto-encoder. 

tion in [—1,1], where I[.] is an indicator function. Other 
choices for the prior are also possible, as long as it is a 
simple enough distribution from which we can easily draw 
samples. 

The h vector is then passed through the neural network and 
deterministically mapped to a vector x £ R ,:> in the D di¬ 
mensional data space. 

x = /(h; w) (5) 

/ is the neural network mapping function, which can con¬ 
tain multiple layers of nonlinearities, and w represents the 
parameters of the neural network. One example architec¬ 
ture for / is illustrated in Figure 1(a), which has 3 inter¬ 
mediate ReLU (Nair & Hinton, 2010) nonlinear layers and 
one logistic sigmoid output layer. 

The prior p(h) and the mapping /(h; w) jointly defines a 
distribution p(x) in the data space. To generate a sample 
x ~ p(x) we only need to sample from the uniform prior 
p(h) and then pass the sample h through the neural net to 
get x = /(h; w). 

Goodfellow et al. (2014) proposed to train this network by 
using an extra discriminative network, which tries to distin¬ 
guish between model samples and data samples. The gen¬ 
erative network is then trained to counteract this in order 
to make the samples indistinguishable to the discriminative 
network. The gradient of this objective can be backprop- 
agated through the generative network. However, because 
of the minimax nature of the formulation, it is easy to get 
stuck at a local optima. So the training of generative net¬ 
work and the discriminative network must be interleaved 
and carefully scheduled. By contrast, our learning algo¬ 
rithm simply involves minimizing the MMD objective. 

Assume we have a dataset of training examples xf..... x(y 
(d for data), and a set of samples generated from our model 
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xf, x. s M ( s for samples). The MMD objective £ M md 1 2 3 is 
differentiable when the kernel is differentiable. For exam¬ 
ple for Gaussian kernels fc(x,y) = exp (—^||x — y|| 2 ), 
the gradient of xf has a simple form 
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This gradient can then be backpropagated through the gen¬ 
erative network to update the parameters w. 


4.2. Auto-Encoder Code Space Networks 

Real-world data can be complicated and high-dimensional, 
which is one reason why generative modelling is such a 
difficult task. Auto-encoders, on the other hand, are de¬ 
signed to solve an arguably simpler task of reconstruction. 
If trained properly, auto-encoder models can be very good 
at representing data in a code space that captures enough 
statistical information that the data can be reliably recon¬ 
structed. 


We found that adding dropout to the encoding layers can be 
beneficial in terms of creating a smooth manifold in code 
space. This is analogous to the motivation behind contrac¬ 
tive and denoising auto-encoders (Rifai et ah, 2011; Vin¬ 
cent et ah, 2008). 

4.3. Practical Considerations 

Here we outline some design choices that we have found to 
improve the peformance of GMMNs. 

Bandwidth Parameter. The bandwidth parameter in the 
kernel plays a crucial role in determining the statistical ef¬ 
ficiency of MMD, and optimally setting it is an open prob¬ 
lem. A good heuristic is to perform a line search to obtain 
the bandwidth that produces the maximal distance (Sripe- 
rumbudur et ah, 2009), other more advanced heuristics are 
also available (Gretton et ah, 2012b). As a simpler approx¬ 
imation, for most of our experiments we use a mixture of 
K kernels spanning multiple ranges. That is, we choose the 
kernel to be: 

K 

k(x,x') =y^k aq (x,x') (7) 

9=1 


The code space of an auto-encoder has several advantages 
for creating a generative model. The first is that the di¬ 
mensionality can be explicitly controlled. Visual data, for 
example, while represented in a high dimension often ex¬ 
ists on a low-dimensional manifold. This is beneficial for a 
statistical estimator like MMD because the amount of data 
required to produce a reliable estimator grows with the di¬ 
mensionality of the data (Ramdas et ah, 2015). The sec¬ 
ond advantage is that each dimension of the code space can 
end up representing complex variations in the original data 
space. This concept is referred to in the literature as disen¬ 
tangling factors of variation (Bengio et ah, 2013a). 

For these reasons, we propose to bootstrap auto-encoder 
models with a GMMN to create what we refer to as the 
GMMN+AE model. These operate by first learning an 
auto-encoder and producing code representations of the 
data, then freezing the auto-encoder weights and learning 
a GMMN to minimize MMD between generated codes and 
data codes. A visualization of this model is given in Figure 
1(b). 

Our method for training a GMMN+AE proceeds as fol¬ 
lows: 

1. Greedy layer-wise pretraining of the auto-encoder 
(Bengio et ah, 2007). 

2. Fine-tune the auto-encoder. 

3. Train a GMMN to model the code layer distribution 
using an MMD objective on the final encoding layer. 


where k„ is a Gaussian kernel with bandwidth parameter 
Cq. We found that choosing simple values for these such as 
1, 5,10, etc. and using a mixture of 5 or more was sufficient 
to obtain good results. The weighting of different kernels 
can be further tuned to achieve better results, but we kept 
them equally weighted for simplicity. 

Square Root Loss. In practice, we have found that better 
results can be obtained by optimizing £mmd = v^mmd 2 - 
This loss can be important for driving the difference be¬ 
tween the two distributions as close to 0 as possible. Com¬ 
pared to C M md 2 which flattens out when its value gets 
close to 0, £mmd behaves much better for small £mmd 
values. Alternatively, this can be understood by writing 
down the gradient of £mmd with respect to w 

9Cmmd _ 1 *9£ M md 2 

dw V^MMD 2 dw 

The 1/(2 v / £ MMD 2) term automatically adapts the effec¬ 
tive learning rate. This is especially beneficial when both 
£ M md 2 and dC Q™ D2 become small, where this extra factor 
can help by maintaining larger gradients. 

Minibatch Training. One of the issues with MMD is that 
the usage of kernels means that the computation of the ob¬ 
jective scales quadratically with the amount of data. In 
the literature there have been several alternative estimators 
designed to overcome this (Gretton et ah, 2012a). In our 
case, we found that it was sufficient to optimize MMD us¬ 
ing minibatch optimization. In each weight update, a small 
subset of data is chosen, and an equal number of samples 
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Algorithm 1: GMMN minibatch training 

Input : Dataset {xf,x^}, priorp(h), network 
/(h; w) with initial parameter w^ 0 ) 
Output: Learned parameter w* 

1 while Stopping criterion not met do 

2 Get a minibatch of data X d +- {x^,..., x^ b } 

3 Get a new set of samples X s +- {x®,x^} 

4 Compute gradient on X d and X® 

5 Take a gradient step to update w 

6 end 


are drawn from the GMMN. Within a minibatch, MMD 
is applied as usual. As we are using exact samples from 
the model and the data distribution, the minibatch MMD is 
still a good estimator of the population MMD. We found 
this approach to be both fast and effective. The minibatch 
training algorithm for GMMN is shown in Algorithm 1 . 

5. Experiments 

We trained GMMNs on two benchmark datasets MNIST 
(LeCun et al., 1998) and the Toronto Face Dataset (TFD) 
(Susskind et al., 2010). For MNIST, we used the standard 
test set of 10,000 images, and split out 5000 from the stan¬ 
dard 60,000 training images for validation. The remaining 
55,000 were used for training. For TFD, we used the same 
training and test sets and fold splits as used by (Goodfellow 
et al., 2014), but split out a small set of the training data and 
used it as the validation set. For both datasets, rescaling the 
images to have pixel intensities between 0 and 1 is the only 
preprocessing step we did. 

On both datasets, we trained the GMMN network in both 
the input data space and the code space of an auto-encoder. 
For all the networks we used in this section, a uniform 
distribution in [— 1, 1] H was used as the prior for the 
IT-dimensional stochastic hidden layer at the top of the 
GMMN, which was followed by 4 ReLU layers, and the 
output was a layer of logistic sigmoid units. The auto¬ 
encoder we used for MNIST had 4 layers, 2 for the encoder 
and 2 for the decoder. For TFD the auto-encoder had 6 lay¬ 
ers in total, 3 for the encoder and 3 for the decoder. For both 
auto-encoders the encoder and the decoder had mirrored 
architectures. All layers in the auto-encoder network used 
sigmoid nonlinearities, which also guaranteed that the code 
space dimensions lay in [0,1], so that they could match the 
GMMN outputs. The network architectures for MNIST are 
shown in Figure 1. 

The auto-encoders were trained separately from the 
GMMN. Cross entropy was used as the reconstruction loss. 
We first did standard layer-wise pretraining, then fine-tuned 
all layers jointly. Dropout (Hinton et al., 2012b) was used 


Model 

MNIST 

TFD 

DBN 

138 ±2 

1909 ± 66 

Stacked CAE 

121 ± 1.6 

2110 ±50 

Deep GSN 

214 ± 1.1 

1890 ± 29 

Adversarial nets 

225 ±2 

2057 ± 26 

GMMN 

147 ±2 

2085 ± 25 

GMMN+AE 

282 ± 2 

2204 ± 20 


Table 1. Log-likelihood of the test sets under different models. 
The baselines are Deep Belief Net (DBN) and Stacked Contrac¬ 
tive Auto-Encoder (Stacked CAE) from (Bengio et al., 2013a), 
Deep Generative Stochastic Network (Deep GSN) from (Bengio 
et al., 2014) and Adversarial nets (GANs) from (Goodfellow et al., 
2014). 

on the encoder layers. After training the auto-encoder, we 
fixed it and passed the input data through the encoder to get 
the corresponding codes. The GMMN network was then 
trained in this code space to match the statistics of gen¬ 
erated codes to the statistics of codes from data examples. 
When generating samples, the generated codes were passed 
through the decoder to get samples in the input data space. 

For all experiments in this section the GMMN networks 
were trained with minibatches of size 1000, for each mini¬ 
batch we generated a set of 1000 samples from the net¬ 
work. The loss and gradient were computed from these 
2000 points. We used the square root loss function Lmmd 
throughout. 

Evaluation of our model is not straight-forward, as we do 
not have an explicit form for the probability density func¬ 
tion, it is not easy to compute the log-likelihood of data. 
However, sampling from our model is easy. We therefore 
followed the same evaluation protocol used in related mod¬ 
els (Bengio et al., 2013a), (Bengio et al., 2014), and (Good¬ 
fellow et al., 2014). A Gaussian Parzen window (kernel 
density estimator) was fit to 10,000 samples generated from 
the model. The likelihood of the test data was then com¬ 
puted under this distribution. The scale parameter of the 
Gaussians was selected using a grid search in a fixed range 
using the validation set. 

The hyperparameters of the networks, including the learn¬ 
ing rate and momentum for both auto-encoder and GMMN 
training, dropout rate for the auto-encoder, and number 
of hidden units on each layer of both auto-encoder and 
GMMN, were tuned using Bayesian optimization (Snoek 
et al., 2012; 2014) 1 to optimize the validation set likelihood 
under the Gaussian Parzen window density estimation. 

The log-likelihood of the test set for both datasets are 
shown in Table 1. The GMMN is competitive with other 
approaches, while the GMMN+AE significantly outper- 

'We used the service provided by https ://www. 
whetlab.com 
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(a) GMMN MNIST samples 




(b) GMMN TFD samples 




(e) GMMN nearest neighbors for MNIST samples 
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(f) GMMN+AE nearest neighbors for MNIST samples 



(g) GMMN nearest neighbors for TFD samples 
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(c) GMMN+AE MNIST samples (d) GMMN+AE TFD samples (h) GMMN+AE nearest neighbors for TFD samples 

Figure 2. Independent samples and their nearest neighbors in the training set for the GMMN+AE model trained on MNIST and TFD 
datasets. For (e)(f)(g) and (h) the top row are the samples from the model and the bottom row are the corresponding nearest neighbors 
from the training set measured by Euclidean distance. 


forms the other models. This shows that despite being rel¬ 
atively simple, MMD, especially when combined with an 
effective decoder, is a powerful objective for training good 
generative models. 

Some samples generated from the GMMN models are 
shown in Figure 2(a-d). The GMMN+AE produces the 
most visually appealing samples, which are reflected in its 
Parzen window log-likelihood estimates. The likely expla¬ 
nation is that any perturbations in the code space corre¬ 
spond to smooth transformations along the manifold of the 
data space. In that sense, the decoder is able to “correct” 
noise in the code space. 

To determine whether the models learned to merely copy 
the data, we follow the example of (Goodfellow et ah, 
2014) and visualize the nearest neighbour of several sam¬ 
ples in terms of Euclidean pixel-wise distance in Figure 
2(e-h). By this metric, it appears as though the samples 
are not merely data examples. 

One of the interesting aspects of a deep generative model 
such as the GMMN is that it is possible to directly ex¬ 
plore the data manifold. Using the GMMN+AE model, 
we randomly sampled 5 points in the uniform space and 
show their corresponding data space projections in Fig¬ 
ure 3. These points are highlighted by red boxes. From left 
to right, top to bottom we linearly interpolate between these 


points in the uniform space and show their corresponding 
projections in data space. The manifold is smooth for the 
most part, and almost all of the projections correspond to 
realistic looking data. For TFD in particular, these transfor¬ 
mations involve complex attributes, such as the changing of 
pose, expression, lighting, gender, and facial hair. 

6. Conclusion and Future Work 

In this paper we provide a simple and effective framework 
for training deep generative models called generative mo¬ 
ment matching networks. Our approach is based off of opti¬ 
mizing maximum mean discrepancy so that samples gener¬ 
ated from the model are indistinguishable from data exam¬ 
ples in terms of their moment statistics. As is standard with 
MMD, the use of the kernel trick allows a GMMN to avoid 
explicitly computing these moments, resulting in a simple 
training objective, and the use of minibatch stochastic gra¬ 
dient descent allows the training to scale to large datasets. 

Our second contribution combines MMD with auto¬ 
encoders for learning a generative model of the code layer. 
The code samples from the model can then be fed through 
the decoder in order to generate samples in the origi¬ 
nal space. The use of auto-encoders makes the gener¬ 
ative model learning a much simpler problem. Com¬ 
bined with MMD, pretrained auto-encoders can be read- 
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(a) MNIST interpolation 



(b) TFD interpolation 

Figure 3. Linear interpolation between 5 uniform random points 
from the GMMN+AE prior projected through the network into 
data space for (a) MNIST and (b) TFD. The 5 random points are 
highlighted with red boxes, and the interpolation goes from left 
to right, top to bottom. The final two rows represent an interpo¬ 
lation between the last highlighted image and the first highlighted 
image. 


ily bootstrapped into a good generative model of data. On 
the MNIST and Toronto Face Database, the GMMN+AE 
model achieves superior performance compared to other 
approaches. For these datasets, we demonstrate that the 
GMMN+AE is able to discover the implicit manifold of 
the data. 

There are many interesting directions for research using 
MMD. One such extension is to consider alternatives to the 
standard MMD criterion in order to speed up training. One 
such possibility is the class of linear-time estimators that 
has been developed recently in the literature (Gretton et al., 


2012a). 

Another possibility is to utilize random features (Rahimi 
& Recht, 2007). These are randomized feature expansions 
whose inner product converges to a kernel function with an 
increasing number of features. This idea was recently ex¬ 
plored for MMD in (Zhao & Meng, 2014). The advantage 
of this approach would be that the cost would no longer 
grow quadratically with minibatch size because we could 
use the original objective given in Equation 2. Another ad¬ 
vantage of this approach is that the data statistics could be 
pre-computed from the entire dataset, which would reduce 
the variance of the objective gradients. 

Another direction we would like to explore is joint train¬ 
ing of the auto-encoder model with the GMMN. Currently, 
these are treated separately, but joint training may encour¬ 
age the learning of codes that are both suitable for recon¬ 
struction as well as generation. 

While a GMMN provides an easy way to sample data, the 
posterior distribution over the latent variables is not readily 
available. It would be interesting to explore ways in which 
to infer the posterior distribution over the latent space. A 
straightforward way to do this is to learn a neural network 
to predict the latent vector given a sample. This is reminis¬ 
cent of the recognition models used in the wake-sleep al¬ 
gorithm (Hinton et al., 1995), or variational auto-encoders 
(Kingma & Welling, 2014). 

An interesting application of MMD that is not directly re¬ 
lated to generative modelling comes from recent work on 
learning fair representations (Zemel et al., 2013). There, 
the objective is to train a prediction method that is invariant 
to a particular sensitive attribute of the data. Their solution 
is to learn an intermediate clustering-based representation. 
MMD could instead be applied to learn a more powerful, 
distributed representation such that the statistics of the rep¬ 
resentation do not change conditioned on the sensitive vari¬ 
able. This idea can be further generalized to learn represen¬ 
tations invariant to known biases. 

Finally, the notion of utilizing an auto-encoder with the 
GMMN+AE model provides new avenues for creating gen¬ 
erative models of even more complex datasets. For exam¬ 
ple, it may be possible to use a GMMN+AE with convolu¬ 
tional auto-encoders (Zeiler et al., 2010; Masci et al., 2011; 
Makhzani & Frey, 2014) in order to create generative mod¬ 
els of high resolution color images. 
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